Composable, Scalable, and Accurate Weight Summarization of Unaggregated Data Sets

نویسندگان

  • Edith Cohen
  • Nick G. Duffield
  • Haim Kaplan
  • Carsten Lund
  • Mikkel Thorup
چکیده

Many data sets occur as unaggregated data sets, where multiple data points are associated with each key. In the aggregate view of the data, the weight of a key is the sum of the weights of data points associated with the key. Examples are measurements of IP packet header streams, distributed data streams produced by events registered by sensor networks, and Web page or multimedia requests to context distribution servers. We aim to combine sampling and aggregation to provide accurate and efficient summaries of the aggregate view. However, data points are scattered in time or across multiple servers and hence aggregation is subject to resource constraints on the size of summaries that can be stored or transmitted. We develop a summarization framework for unaggregated data where summarization is a scalable and composable operator, and as such, can be tailored to meet resource constraints. Our summaries support unbiased estimates of the weight of subpopulations of keys specified using arbitrary selection predicates. While we prove that under such scenarios there is no variance optimal scheme, our estimators have the desirable properties that the variance is progressively closer to the minimum possible when applied to a “more” aggregated data set. An extensive evaluation using synthetic and real data sets shows that our summarization framework outperforms all existing schemes for this fundamental problem, even for the special and well-studied case of data streams.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Expectation-Maximization Algorithm Working on Data Summary

Scalable cluster analysis addresses the problem of processing large data sets with limited resources, e.g., memory and computation time. A data summarization or sampling procedure is an essential step of most scalable algorithms. It forms a compact representation of the data. Based on it, traditional clustering algorithms can process large data sets efficiently. However, there is little work on...

متن کامل

Algorithms and estimators for summarization of unaggregated data streams

1 Statistical summaries of IP traffic are at the heart of network operation and are used to recover information on arbitrary subpopulations of flows. It is therefore of great importance to collect the most accurate and informative summaries given the router’s resource constraints. IP packet streams consist of multiple interleaving IP flows. While queries are posed over the set of flows, the sum...

متن کامل

Tuple Graph Synopses for Relational Data Sets∗

This paper introduces the Tuple Graph (TuG) synopses, a new class of data summaries that enable accurate selectivity estimates for complex relational queries. The proposed summarization framework adopts a “semi-structured” view of the relational database, modeling a relational data set as a graph of tuples and join queries as graph traversals respectively. The key idea is to approximate the str...

متن کامل

Knowledge Summarization for Scalable Semantic Data Processing

Scalable semantic data processing has become a crucial issue for practical applications of the Semantic Web. In this paper, we propose an approach of scalable semantic data processing by knowledge summarization. The main idea is to express scalable semantic data on different abstraction and summarization levels to reduce their cardinalities, so that they can be processed efficiently. The notion...

متن کامل

Scalable Model-based Clustering Algorithms for Large Databases and Their Applications

With the unabated growth of data amassed from business, scientific and engineering disciplines, cluster analysis and other data mining functionalities, play a more and more important role. They can reveal previously unknown and potentially useful patterns and relations in large databases. One of the most significant challenges in data mining is scalability — effectively handling large databases...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2009